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Abstract 

Background: Professionalism is a difficult construct to define in medical students but aspects of this concept may 
be important in predicting the risk of postgraduate misconduct. For this reason attempts are being made to 
evaluate medical students' professionalism. This study investigated the psychometric properties of Selected 
Response Questions (SRQs) relating to the theme of professional conduct and ethics comparing them with two 
sets of control items: those testing pure knowledge of anatomy, and; items evaluating the ability to integrate and 
apply knowledge ("skills"). The performance of students on the SRQs was also compared with two external 
measures estimating aspects of professionalism in students; peer ratings of professionalism and their 
Conscientiousness Index, an objective measure of behaviours at medical school. 

Methods: Item Response Theory (IRT) was used to analyse both question and student performance for SRQs 
relating to knowledge of professionalism, pure anatomy and skills. The relative difficulties, discrimination and 
'guessabilities' of each theme of question were compared with each other using Analysis of Variance (ANOVA). 
Student performance on each topic was compared with the measures of conscientiousness and professionalism 
using parametric and non-parametric tests as appropriate. A post-hoc analysis of power for the IRT modelling was 
conducted using a Monte Carlo simulation. 

Results: Professionalism items were less difficult compared to the anatomy and skills SRQs, poorer at discriminating 
between candidates and more erratically answered when compared to anatomy questions. Moreover 
professionalism item performance was uncorrected with the standardised Conscientiousness Index scores (rho = 
0.009, p = 0.90). In contrast there were modest but significant correlations between standardised Conscientiousness 
Index scores and performance at anatomy items (rho = 0.20, p = 0.006) though not skills (rho = .11, p = .1). 
Likewise, students with high peer ratings for professionalism had superior performance on anatomy SRQs but not 
professionalism themed questions. A trend of borderline significance (p = .07) was observed for performance on 
skills SRQs and professionalism nomination status. 

Conclusions: SRQs related to professionalism are likely to have relatively poor psychometric properties and lack 
associations with other constructs associated with undergraduate professional behaviour. The findings suggest that 
such questions should not be included in undergraduate examinations and may raise issues with the introduction 
of Situational Judgement Tests into Foundation Years selection. 
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Background 

Promoting professionalism may be at once the most 
important and least successful aspect of medical training 
with well documented challenges in both defining [1] and 
assessing the construct [2] . Moreover, professionalism is a 
highly culture-bound construct and may vary according to 
the stage of educational development [3] . It is also unclear 
whether professionalism is a learned [4] or acquired char- 
acteristic. A recent study indicated that cases of completed 
disciplinary action were more likely to be men, to be of 
lower estimated social class, and to have had academic dif- 
ficulties during their medical course, especially in the early 
years [5]. At least two of these three features are not attri- 
butable to the teaching of professionalism. If this proves 
indeed to be the case, a student could only be selected on 
the basis of professionalism, not taught it. In any event, 
the accuracy of evaluation of professionalism in medical 
students has implications for patient safety as well as indi- 
vidual development. 

It is an assumption that professionalism has to be 
defined before it can be taught or measured. This may 
not be true: expert connoisseurship [6] can recognise 
situations which cannot be defined, just as a connoisseur 
may be able to recognise the quality of a new whisky 
without a checklist. Unsurprisingly there is no consen- 
sus on how to measure professionalism in undergradu- 
ates. Wilkinson has recently categorised five major 
themes in measuring professionalism [2], These can be 
summarised as adherence to ethical practice principles, 
effective interactions with patients and their significant 
others, effective interactions with other health profes- 
sionals, reliability, and commitment to competence. 
However, approaches to assessing professionalism have 
usually focussed on subjective decisions by those who 
have observed the candidate in action. Such measures 
are of low reliability, in that person-person interactions 
are strong, and the phenomenon of 'failure to fail' may 
apply, with assessors reluctant to dispense less than a 
pass grade [7]. This may be for a variety of reasons; the 
assessor may lack confidence in the assessment method; 
they may have formed a bond with the assessee; or they 
may just regard it as likely to cause too much trouble. 
Such assessments may have low validity, in that only the 
behaviour under test is scored, but attract a high eco- 
nomic cost given that such decisions are often made by 
senior clinicians. An attempt to mimic the reliability of 
Mini CEX has been pursued through the development 
of the Professionalism Mini-Examination PMEX [8]. 
This uses a scoring pad for observation of undergradu- 
ates in training. However, this instrument still suffers 
from the problems of a limited number of observations, 
'failure to fail', and person-person interaction. 

In the Durham University Medical Programme the 
measurement of diligence or conscientiousness has been 



explored as an index, or at least one component, of pro- 
fessionalism (the Conscientiousness Index) [9,10]. We 
have been able to demonstrate that there is a relation- 
ship between measures of conscientiousness in routine 
tasks and independent estimates of professionalism 
made independently by faculty and student peers. The 
measure also appears to have good reliability and Con- 
scientiousness Index scores were found to be statistically 
significantly (p <.05) inversely correlated with the num- 
ber of nominations for "least professional" individual by 
other students within their peer groups [9,10]. While 
this concurrent validity evidence is not of the same 
value as predictive validity evidence, it is none the less 
interesting as validated, objective, reliable and scalar 
information on professionalism in undergraduate medi- 
cal students. This gives us an opportunity to explore 
relationships between the Conscientiousness Index and 
other potential measures of professionalism which may 
be used as predictors of future performance. 

One US-based study reported a negative association 
between assessment performance during internships and 
the likelihood of referral for disciplinary action in later 
medical careers [11]. Moreover, the authors reported a 
positive relationship between evidence of poor profes- 
sionalism ratings during internships and the likelihood 
of referral for disciplinary action in later medical 
careers. Both findings could be explained by conscien- 
tiousness acting as a mediator between assessment per- 
formance and professionalism ratings. This issue is 
particularly relevant at present, since it is proposed that 
Situational Judgement Tests (SJTs) will be used for the 
high stakes selection of candidates for Foundation places 
in the UK from 2012. These SJTs themselves take the 
form of Selected Response Questions (SRQs) whereby a 
candidate is offered a short written vignette concerning 
a complex work situation and selects one or more of 
the most appropriate professional responses from a list 
of responses [12]. There is evidence that these are posi- 
tive predictors for workplace performance [13]. How- 
ever, they have not been tested with regard to 
undergraduate performance. In addition, SJTs and 
knowledge-based tests are currently used as measures of 
performance with regard to professionalism in some 
undergraduate curricula. The argument could be made 
that, although knowledge of the ethical course of action 
is not evidence of an intention to act ethically, it is an 
essential prerequisite. We have therefore analysed 
undergraduate student performance on SRQs in com- 
parison with their conscientiousness and peer ratings of 
professionalism. Our aim was to evaluate whether there 
was any evidence to support the use of SRQs when eval- 
uating professionalism. The primary objective was to 
compare both item and student person performance on 
SRQs concerned with professional behaviour with two 
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other types of control question. Thus, our hypothesis 
was that ratings of professionalism and conscientious- 
ness would be more strongly associated with perfor- 
mance on SRQs probing knowledge of professional 
conduct compared to other types of item. 

Methods 

Study Design 

A cross-sectional survey design was utilised in order to 
examine the relationship between the variables under 
study. Data from two consecutive cohorts of medical 
undergraduates during their first two years at medical 
school were utilised. 

Data collection 

There were 96 students in the first cohort and 98 in the 
second. Examination results were available from four 
examinations for the first cohort and three examinations 
for the second. The SRQ-based examinations conducted 
in the first two years consist of multiple choice question 
(MCQ) items, where a single best answer was selected 
from a choice of five responses, and Extended Matching 
Questions (EMQs). In the case of EMQs each item has 
a themed list of possible responses and multiple ques- 
tions linked to this with the candidate aiming to match 
a response to each question. At Durham University the 
examinations conducted in years I and II cover a wide 
range of topics including immunology, microbiology, 
anatomy (pure and applied), medical ethics and physiol- 
ogy. In turn, items are allocated to three main cate- 
gories: 'Knowledge and Critical Thinking'; 'Skills'; and 
'Professional Behaviours'. The Professional Behaviours 
domain includes both reflective writing and understand- 
ing of how to behave professionally. Within the four 
examinations the first cohort answered 14 MCQs and 
25 EMQs (i.e. 5 sets of response lists) on professional- 
ism. The second cohort answered eight MCQs and 20 
EMQs (i.e. 4 response lists) on this domain. An example 
of a professionalism MCQ item would be; "You are on 
your Community Placement which offers bereavement 
counselling. In one session the placement worker, who 
you are shadowing, deals harshly with a crying client. 
This has never happened before. Do you: 

a) challenge the placement worker in front of the 
client? 

b) pretend it didn't happen and say /do nothing? 

c) take over the counselling session yourself ? 

d) confront the placement worker afterwards in 
private? 

e) report the placement worker to his/her superior?" 

In order to explore the properties of the professional- 
ism items they were compared to the responses to 



questions relating to "pure" (as opposed to applied) 
knowledge of anatomy. This theme was selected to serve 
as a control set of items as it was hypothesised that 
acquisition of anatomical knowledge was more likely to 
require conscientious study than knowledge of profes- 
sionalism. The first cohort answered 22 MCQs and 55 
EMQs and the second cohort answered seven MCQs 
and 25 EMQs on anatomy. A third set of SRQs, taken 
from the 'Skills' category of question was also included 
as an alternative comparison group. These items were 
designed to test the skill of drawing on knowledge 
(sometimes from different topics) and applying the 
information to clinical problems. An example of a skills 
themed SRQ would be; "Radiological imaging is com- 
monly used in the investigation of the hepatobiliary and 
GI tract. Which of the following statements is true when 
a clinician is considering what type of image to request? 

a) The skill of the operator is paramount in obtain- 
ing a plain abdominal film 

b) Fluoroscopy cannot demonstrate oesophageal 
motility 

c) A double contrast barium enema will rarely visua- 
lise the caecum 

d) Magnetic resonance imaging exposes the patient to 
considerably less ionising radiation than compu- 
terised tomography 

e) Endoscopic retrograde pancreatography is very use- 
ful to assess pancreatic function" 

Responses from the first cohort to 24 MCQs and 29 
EMQs from the skills category were analysed. For the 
second cohort 16 MCQs and 5 EMQs relating to skills 
were utilised. The responses to these items were ana- 
lysed using a Rasch analysis (see below) in order to gen- 
erate an interval metric of estimated student ability in 
relation to knowledge of professionalism, anatomy and 
skills. 

Data relating to the Conscientiousness Index was also 
available for each student as a percentage of the total 
"conscientiousness points" available. This measure relies 
on objective information such as attendance at teaching 
sessions and compliance with administrative tasks such 
as submission of immunisation documentation [10]. In 
order to compare the two cohorts accurately the Con- 
scientiousness Index percentages were converted to 
standardised z scores. In addition, information was avail- 
able relating to peer nominations for professionalism. 
This approach has been previously shown to detect 
"extremes" and those students who have received a high 
number of nominations for being perceived as "least 
professional" had, on average, lower Conscientiousness 
Index scores [9]. In the first cohort peer nominations 
were conducted within the peer group. In order to 
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increase participation, for the subsequent cohort, peer 
assessment was conducted within tutor groups. This 
change was made because students had reported they 
felt it was easier to make accurate nominations within a 
tutor group where there was more familiarity with 
peers, rather than within a year group. For both year 
groups nominations were converted into an aggregate 
score of professionalism by subtracting nomination for 
least professional from those for most professional. Cut- 
offs were generated in order to identify the top 10% and 
bottom 10% of aggregate scores within each year group. 
Thus students were categorised as having peer profes- 
sionalism ratings that were high, low or neither. 

Item response modelling 

Item response modelling and theory (IRT) is based on 
the modified factor analysis of binary and categorical 
data. Within the family of IRT models Rasch analysis 
was originally developed for the exploration of dichoto- 
mous responses to test items [14]. Rasch analysis can be 
used to create interval metrics of both item difficulty 
and respondent ability from ordinal (ordered categori- 
cal) or binary (dichotomous) response data. The Rasch 
model assumes that all items are identical in terms of 
their ability to discriminate between respondents 
according to ability (i.e. equality of item factor loadings 
in classical factor analytic terms). Nevertheless, Rasch 
software is able to provide simulated estimates of other 
parameters aside from difficulty and ability such as the 
degree of discrimination an item provides in determin- 
ing the level of the underlying trait in a respondent. In 
addition, an estimated value for a lower asymptote is 
provided which represents an index of "guessing". Nor- 
mally these latter values are estimated using the less 
constrained two and three parameter (2-PL, 3-PL) logis- 
tic models rather than the Rasch model. The WIN- 
STEPS programme is able to provide indices of these 
parameters which are purported to be as accurate as 
those provided by less constrained models [15-17]. In a 
Rasch analysis reliability can be appraised in a number 
of ways; the person reliability coefficient relates to the 
replicability of the ranking of abilities while the person 
separation index represents the signal to noise ratio and 
estimates the ability of a test to reliably differentiate dif- 
ferent levels of ability within a cohort [18]. A descrip- 
tion of IRT and its potential application in a medical 
education setting has been previously published [19]. 

The Rasch analysis was conducted in two ways. Firstly, 
to construct interval measures of performance at each 
type of question, the items of each type were pooled 
and analysed by cohort. For example, for performance 
on professionalism items included in the first cohort's 
examinations all responses to items relating to this 
theme were pooled across exams and Rasch analysed as 



a batch. Estimates of ability were derived for both MCQ 
and EMQ format items in order to evaluate whether the 
two types of items should be combined. Reliable test- 
equating between examinations sat by different cohorts 
was not possible as there were no common items 
included. For this reason ability estimates on the three 
domains (skills, anatomy and professionalism) were stan- 
dardised as z scores for each cohort. Secondly, the rela- 
tive item characteristics for each theme [skills, anatomy 
and professionalism) were compared by performing a 
Rasch analysis separately for each exam. 

The Rasch model assumes local independence (i.e. 
there should be no correlation between responses once 
the "Rasch dimension" has been controlled for). If this 
assumption is violated then values such as ability and 
person separation estimates may be overestimated. In 
the case of EMQs the item responses are related to the 
same stem. Thus, there was a risk that this assumption 
of local independence would not hold either because the 
response related to a particular area of specialised 
knowledge or the stem question posed was asked in a 
particular way (i.e. a method effect). For this reason we 
examined the data for evidence of systematic non-inde- 
pendence in the responses as evidenced by correlated 
residuals between responses to EMQs relating to the 
same stem. There were surprisingly few relatively large 
(i.e. > 0.3) correlated residuals observed, with most sets 
of items having one or no pairs of correlated residuals 
present (in some cases these were not even between 
items relating to the same stem). The effect of such 
local dependency was evaluated using the method 
recommended by Linacre [16]. Firstly "testlets" of locally 
dependent items were produced by summing their 
responses. The model was then re-estimated using the 
partial credit Rasch model (which accommodates more 
than two response categories). The old and new person 
ability estimates recovered from the model were then 
cross-plotted with the original values obtained and 
examined for evidence of change. No obvious changes 
in the estimates were noted, the only exception being 
the anatomy SRQs completed by the first cohort. In this 
case eight of the 33 items were found to have correlated 
residuals (seven of which were related to the same 
stem). When testlets of items with correlated residuals 
were constructed and entered into the model around 
24% of the anatomy ability estimates (relating to 23 stu- 
dents) for that cohort markedly changed (i.e. departed 
from the diagonal of the cross-plot). For this reason per- 
formance on the anatomy SRQs for the first cohort was 
estimated using this method. Likewise, the person 
separation index for anatomy was calculated on the 
basis of this analysis using testlets. 

Of the 1,064 items evaluated from the seven exams 
ten had not been scored due to problems with wording/ 
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ambiguity that were discovered after administration. A 
further 21 items were answered correctly by every stu- 
dent and therefore did not provide any information. 
When comparing the professionalism items with those 
of other themes such items that had been answered cor- 
rectly in all cases (i.e. those where the difficulty could 
not be calibrated) were included when analysing the 
comparative facility of the questions. In these cases such 
items were assumed to be very easy and assigned an 
arbitrary difficulty of -5 logits to reflect this. The value 
of -5 was selected as it was consistent with the lowest 
difficulty scores for those items where information was 
available. Item difficulty estimates were normally distrib- 
uted (when the items where difficulty had been fixed at 
-5 logits were excluded) and therefore Analysis of var- 
iance (ANOVA) was used to assess for intergroup differ- 
ences. Discrimination estimates were significantly skew 
and therefore intergroup differences were compared 
using a Kruskal-Wallis test. 

Power issues in Rasch analysis are a matter for debate 
with some authors suggesting that around 200 respon- 
dents are required to accurately estimate item difficulty 
whilst others suggest as few as 30 participants may be 
required in well-targeted tests (i.e. those where difficulty 
is well matched to ability) [20-22]. For this reason a 
post-hoc power exploration was performed using a 
Monte Carlo simulation study [23]. This was carried out 
in two stages according to the method described by 
Muthen and Muthen [24] as implemented in Mplus ver- 
sion 5.21 [25]. The simulation used responses from the 
smaller first cohort and was conducted over 10,000 
iterations. The results were examined for evidence of 
bias in the replicated item difficulty values [23] . 

For normally distributed variables pairwise correla- 
tions and ANOVA were performed in STATA version 
10 [26]. Where the variable was observed to be non- 
normally distributed according to a significance test [27] 
then an appropriate non-parametric comparison was 
performed. 

Ethical Approval 

The SRQ data utilised by this study was routinely gath- 
ered for assessment and course monitoring purposes. 
Anonymity was maintained for all students during the 
analysis process by use of a unique identifier code. Stu- 
dents were advised that such data was being collected 
and could be used in non-identifiable form. Ethical suit- 
ability of these studies for publication was confirmed in 
writing by the Chair of the School's Ethics Committee. 
Other data used in the present analysis was collected as 
part of research that had been given ethical approval by 
the Durham University School for Health Research 
Ethics Committee. It has previously been argued that 
data collected for routine assessment purposes may be 



subsequently used for research purposes as long as the 
data is anonymised, and no harms can result from its 
use [28]. 

Results 

Exam Item Characteristics 

Where a trait or ability conforms to the assumption of 
unidimensionality made by the Rasch model there 
should be relatively little correlation between responses 
once the effect of the underlying dimension has been 
removed. In the present study the "Rasch factor analy- 
sis" findings generally supported this assumption in that 
the contrasts within the residuals from a Rasch Factor 
Analysis consistently explained less than approximately 
5% of the unexplained variance in item responses [16]. 
However, the Rasch factor analysis for the skills items 
completed by the second cohort suggested the presence 
of at least a second dimension indicated by the first 
contrast in the residuals explaining 7.5% of the variance. 
The item characteristics, as estimated by the Rasch ana- 
lysis, are depicted in Table 1. 

Candidates performed significantly better on Profes- 
sionalism items compared to anatomy (F = 13.44, p < 
0.001) and skills questions (F = 6.04, p = .02). In addi- 
tion the estimates of the professionalism item discrimi- 
nation parameters were significantly lower compared to 
those for anatomy (F = 19.55, p = < 0.001) but not skills 
items (F = .14, p = .7). This implies that the profession- 
alism items were easier compared to the other two item 
types and poorer at discriminating candidates of differ- 
ing abilities compared to the anatomy items. In terms of 
the fit of item responses to the Rasch model, responses 
to anatomy items were mildly skew towards overfitting 
the model according to 'infit' (information weighted) 
indices: the average z score for infit for anatomy items 
was -.20 reflecting a tendency to less variation in 
responses than the Rasch model would have predicted. 
In contrast, the professionalism items were skew towards 
underfit with a mean z score of .39. This reflected a 
trend to a slightly more erratic response pattern than 
might be expected under the assumptions of the Rasch 
model. Skills items had fit indices intermediate between 
these two former themes. Thus, anatomy item perfor- 
mance appeared to be more predictable than the 
response patterns observed for the professionalism items. 

Person reliability indices were relatively high for esti- 
mation of ability at anatomy items: for the first cohort 
the person reliability index .82 and the person separa- 
tion value was 2.15 (for the second cohort these values 
were .73 and 1.63 respectively). In contrast the person 
reliability indices for professionalism and skills items 
were much lower: for professionalism, person reliability 
was 0.32 and separation was 0.69 for the first cohort. 
For the second cohort these values were .43 and .87 
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Table 1 Item characteristics relating to the themes of professionalism, anatomy or skills from the seven exams taken 
by the two cohorts attending years I and II of medical school at Durham University 





niffirnltv l^fi\ 1 nnitc 


nic/rimi nstinn fcri^ 
lllllllaLIUII puj 


7 Infit fcrh 

1— II II 1 1 IjU ) 


7 Outfit f«H1 


Cnii^ccinn 1 nrlpy I zcW 

VJUCjjMIU lllUCA \3UJ 


Anatomy MCQs 


.54 (1.2) 


1 .08 (.3) 


-.27 (.8) 


-.30 (.9) 


.02 (.1) 


Anatomy EMQs 


-.47 (1.6) 


1 .07 (.2) 


-.17 (.6) 


-.35 (.7) 


■06 (.2) 


Anatomy Combined 


-.16 (1.6) 


1 .08 (.2) 


-0.2 (.7) 


-.33 (.8) 


■05 (.2) 


Skills MCQs 


-.57 (1.9) 


-91 (.3) 


■27 (.8) 


.38 (1.0) 


.1 1 (-3) 


Skills EMQs 


-.09 (1.7) 


■92 (.2) 


.29 (.6) 


.33 (.8) 


■05 (.2) 


Skills Combined 


-.35 (1.8) 


-92 (.3) 


.28 (.7) 


.35 (.9) 


.08 (.2) 


Prof. MCQs 


-.38 (2.0) 


■81 (.5) 


.60 (1.1) 


.82 (1.1) 


■ 19 (.3) 


Prof. EMQs 


-1.47(1.8) 


■94 (.2) 


■29 (.5) 


.34 (.7) 


.07 (.2) 


Prof. Combined 


-1.1 1( 1.9) 


.90 (.3) 


.39 (.8) 


.50 (.9) 


■ 1 1 (-3) 



The estimates of relative item difficulty, discrimination, standardised "infit'V'outfit" and a "guessing index" are depicted with their respective standard deviations. 



respectively. For skills items person reliability was 0.42 
and separation was 0.85 for the first cohort. For the sec- 
ond cohort these values were .42 and .85 respectively. 
This implies that both professionalism and skills items 
have a limited ability to discriminate between high and 
low performers on these measures. 

Relationship between ability estimates and 
conscientiousness/professionalism 

The performance estimates derived from EMQs and 
MCQs were highly correlated. For example, ability at 
anatomy items as evaluated by performance at both 
EMQs and MCQs correlated highly with ability solely 
judged by relevant MCQs (r = 0.80) and EMQs (r = 
0.94). For this reason the performance estimates utilised 
were those derived from analysis of both SRQ formats 
for the relevant items. Performance estimates for the 
SRQs were normally distributed. However, Conscien- 
tiousness Index scores were significantly skew, therefore 
Spearman's rank correlation test was used when com- 
paring this variable with others. Performance on profes- 
sionalism items was not significantly correlated with 
anatomy performance (r = .12, p = .1). In contrast, pro- 
fessionalism and skills performance was modestly corre- 
lated (r = .27, p <.001) as was ability at anatomy and 
skills items (r = .35, p <.001). 

Professionalism item performance was uncorrelated 
with the standardised Conscientiousness Index scores 
(rho = 0.009, p = 0.90). A slight non-significant trend was 
noted for performance on skills items (rho = 0.11, p = 
0.1) and Conscientiousness Index scores. In contrast 
there were modest but significant correlations between 
standardised Conscientiousness Index and performance 
on anatomy items (rho = 0.20, p = 0.006). Analysis of 
variance was also used to test for standardised perfor- 
mance on the SRQs and Conscientiousness Index 
according to peer professionalism aggregate score cate- 
gory (high professionalism, low professionalism or 
neither). The results are depicted in Table 2, highlighting 



a number of intergroup performance differences, though 
notably not on the professionalism SRQs, where differ- 
ences did not reach statistical significance (p > .1 in all 
cases). 

Findings from the Monte Carlo simulations 

The Monte Carlo simulation suggested that, in general, 
the difficulty estimates were well replicated for both the 
anatomy and the professionalism items with bias of 
around 1-2%, even when using the smaller cohort of 98 
students. However this was not true for a number of 
very easy "mistargeted" items with difficulty values of 
-3.0 logits or less (as scaled according to person ability) 
where bias was 8.6 to 110%. For the overall professional- 
ism items the average bias between the actual popula- 
tion and simulated values was 10.9%. However when the 
seven very easy items with were excluded an average 
bias of 1.2% was observed. Likewise, the simulated and 
actual estimates of item difficulty for the anatomy items 
were generally between 1-5% with the exception of ten 
very easy items of difficulty -3 logits or less. When these 
were excluded the average bias in the estimates was 
1.9%. These results implied that, with the exception of 
this small number of "mistargeted" questions, the study 
was adequately powered to estimate the item character- 
istics accurately. 

Discussion 

According to the IRT-based analysis, the psychometric 
properties of the professionalism SRQs were inferior to 
those of items relating to the testing of knowledge of 
anatomy. In particular professionalism items were rela- 
tively poor at discriminating between candidates. This is 
especially highlighted by the low person separation 
indices observed for these items; in order to reliably dis- 
criminate between two groups of candidates a person 
separation index of more than two would be required. 
In the case of the professionalism items these values 
were much less than one. The relationship between 
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Table 2 Standardised performance on Conscientiousness Index z scores and the three groups of Selected Response 
Questions (SRQs- logit z scores) according to peer rating category for students in both cohorts (N = 194) 

Peer Ratings of Conscientious. Index z Anatomy SRQ Performance Skills SRQ Performance Professionalism SRQ 

Professionalism scores* Mean (SD) Mean (SD) Performance 

High (N = 13) .83(7)** .75(1.1)* ,21(.9) § .33(1.1) 

Neither (N = 163) .01(1.0)** -.06(.9) .02(1.0) .02(1.0) 

Low(N=16) -.74(1.1)** .33(1.1) -40(.9) -.34(1.0) 

** All intergroup differences significant at the p <.01 level 

* Intergroup difference between "High" and "Neither" group significant at the p <.01 level 
§ Intergroup difference between "high" and "low" group of borderline significance at p = .07 



Conscientiousness Index scores, professionalism peer 
nominations and performance at anatomy SRQs were 
modest but statistically significant. However, no such 
relationships were observed between these former mea- 
sures and performance at professionalism items. The 
third set of skills items, relating to the application of 
knowledge were observed to have psychometric proper- 
ties somewhat intermediate between those of the profes- 
sionalism and anatomy items. Although there were no 
statistically significant associations between performance 
on skills items and the ratings of conscientiousness and 
professionalism there was at least the suggestion of a 
trend. As with the professionalism items, the person 
separation indices for the skills items were relatively 
low. Taken together the characteristics of the three 
types of item may imply that the testing of applied, as 
opposed to pure, knowledge is generally less reliable 
using the SRQ format. This possibility may at least 
partly explain the poor psychometric properties of the 
professionalism items, which suggest that SRQs may not 
be an appropriate measure or predicator of professional- 
ism, at least for undergraduate medical students. 

Whilst some clinical exposure occurs during the first 
two years of Durham University Medical School train- 
ing, knowledge based-performance is still the main 
focus of study. Therefore, conscientious study may be 
more closely allied to peer perceptions of professional- 
ism than in later stages of medical training, where more 
patient and staff interactions are observed amongst 
peers. It could be argued that performance on anatomy 
items most closely reflects this aspect of professionalism, 
given that without conscientious study it is difficult to 
perform well on this topic. However, the converse was 
not true in that those that peers perceived as least pro- 
fessional did not demonstrate a poorer performance on 
any area assessed by SRQs. This suggests that medical 
students may be relatively accurate at perceiving high 
but not low levels of conscientiousness, in contrast to 
previous findings where Conscientiousness Index was 
associated with low but not high ratings of professional- 
ism. This apparent anomaly could be due to the wider 
definition of Conscientiousness Index, which encapsu- 
lates a range of information on behaviour, in contrast to 



anatomy performance which is restricted in scope. Thus 
these two correlates of conscientiousness may be related 
to professionalism in subtly different ways. 

It is also necessary to explain why the present findings 
seem to be at odds with those reported by Patterson et 
al; that SJTs predict workplace performance by GP trai- 
nees [13]. There are two possible explanations. Firstly, 
professionalism may be developmental in nature, and 
perhaps early undergraduate medical students do not 
respond appropriately to SJTs because they have not yet 
developed appropriate situational judgement. The other, 
more encouraging version, is that SJTs measure aspects 
of professionalism different from those measured by the 
Conscientiousness Index. The strongest association we 
have found between the Conscientiousness Index and 
professionalism suggests that conscientiousness accounts 
for 25% of the variance in professionalism. While this is 
the largest single component that has been identified, at 
least to our knowledge, it leaves room for other, 
unknown, components to play significant roles, and 
there is no reason to believe that these co-vary with 
conscientiousness. The other four members of Psychol- 
ogy's 'Big Five' (extroversion, neuroticism, agreeableness 
and openness to experience [29]) would be obvious can- 
didates. Equally, Wilkinson identifies five clusters of 
measures of professionalism, one of which clearly corre- 
lates with conscientiousness, and the other four may 
well represent different aspects of professionalism [2]. In 
addition, it is possible that increased patient exposure in 
later training years may increase students understanding 
of the correct response in clinical situations and lead to 
more consistent responses to items related to profes- 
sional behaviour. 

Rasch analysis has previously shown to be a useful 
approach when exploring the psychometric properties of 
medical undergraduate exam SRQs [30]. Although not 
the focus of the present study, the findings from the 
present Rasch analysis of the exam items also suggest 
that SRQ format (e.g. EMQ versus MCQ) may influence 
their characteristics in a topic specific way. This obser- 
vation merits further research. More importantly, the 
findings from the present study should raise some con- 
cerns regarding the use of SJTs for selection to 
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Foundation, as proposed by the Medical Schools Coun- 
cil. As the candidates for these latter high-stakes assess- 
ments fall between undergraduate and postgraduate 
supporting evidence regarding the properties of these 
tests in populations at that stage of professional devel- 
opment is urgently required. If these tests do not per- 
form adequately it may result in strong candidates 
failing to obtain one of their preferred foundation year 
posts, or in the worst case scenario, any post at all. 

Strengths and limitations 

This is the first published study to combine two dis- 
tinct indices of professionalism with SRQ performance, 
using an IRT approach. The application of IRT allowed 
an interval metric of ability to be constructed from the 
exam question responses. Moreover, the psychometric 
properties of the items could be explored more fully 
than classical test-theory would normally allow. Ideally 
the ratings of professionalism in the two year groups 
would have been both derived from tutor group ratings 
and therefore some caution must be exercised in inter- 
preting the professionalism nominations. One of the 
strengths of IRT is the ability to derive relatively distri- 
bution free measures of performance. It would there- 
fore have been desirable to use test-equating via 
shared items to link absolute-SRQ ability across year 
groups rather than standardised Rasch scores, although 
the lack of shared questions precluded this. 

The SRQ response data utilised in this study did not 
include sociodemographic variables, such as gender and 
ethnicity. Thus, it was not possible to assess the 
response data for the presence of differential item func- 
tioning (DIF- response bias not due to underlying abil- 
ity) according to such candidate characteristics. This 
may be an important area of future research. 

The Monte Carlo simulation suggested that the item 
difficulty estimates were precise and reliable in the 
majority of cases. However, item discrimination and 
guessing parameters should ideally be evaluated via a 
full two-parameter logistic model, rather than estimated 
using the more constrained Rasch model. Thus, the 
application of IRT, whilst possible with a relatively small 
number of respondents, is more suited to larger popula- 
tion samples. 

Conclusion 

The findings of this study imply that SRQs relating to the 
theme of professional behaviour are likely to have poor 
psychometric properties and suggests that such questions 
should not be routinely included in medical school 
exams. Further work could explore whether these results 
generalise to the use of SJTs in later stages of medical 
training. Efforts should be directed at developing reliable 



and valid estimates of professionalism combining multi- 
ple data sources. 
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